source

Celebrating Black Lives

In years gone by, Black History Month has been the only time of year when people talk about the achievements of Black People and recognise their contributions.

We want to change this by analysing African-American Firsts, that have historically marked footholds. These breakings of the colour barrier across a wide range of topics have often led to widespread cultural change.

In the following work, we will

  • scrape data from Wikipedia, where a list of African-American Firsts can be found
  • enrich this data by scraping additional information from other Wikipedia sites
  • clean the data and add population information
  • look at the number of African-American Firsts over time and draw connections to American history
  • analyse where the achievers were born (geospatial analysis)
  • take a critical look at the gender gap
  • see how old the achievers were when breaking the colour barrier

Fasten your seatbelt and let the wild ride begin!


source

The Data

As indicated above, the process of data extraction, enrichment, and cleaning is quite complex and time-consuming. We will execute the following steps:

  • We will first scrape the list of African-American Firsts, that can be found on Wikipedia.
  • After that, we will scrape the individual Wikipedia webpage of each individual achiever to get their birthdays and birth locations.
  • Third, we will use the scraped location information to get geospatial information via forward geocoding.
  • Fourth, we infer the gender of each achiever and the category of the achievement.
  • Finally, we clean up loose ends, load shapefiles for America and enrich this geospatial data by adding population information.

Scraping List of African-American Firsts

In this part, we scrape the complete Wikipedia page containing the list of African-American Firsts. For this, we leverage the rvest library. Even though this part is inspired by tidytuesday, we have completely revised the code, such that only little resemblance is left.

# define URL of the list of African-American Firsts
first_url <- "https://en.wikipedia.org/wiki/List_of_African-American_firsts"


# load complete wikipedia page into R
raw_first <- read_html(first_url)


# function to extract the year of a "first" from the raw HTML code
get_year <- function(id_num){
  
  # parse raw HTML code to extract the year of the "first"
  raw_first %>% 
    html_nodes(glue::glue("#mw-content-text > div > h4:nth-child({id_num}) > span.mw-headline")) %>% 
    html_attr("id")
  
}


# function to extract the complete line / entry of each "first" from the raw HTML code
get_first <- function(id_num){
  
  # parse raw HTML code to extract the line / entry of the "first"
  raw_first %>% 
    html_nodes(glue::glue("#mw-content-text > div > ul:nth-child({id_num}) > li")) %>% 
    # store multiple "first" per year in a list
    lapply(function(x) x)
  
}


# find years and complete lines / entries of the "firsts" in the scraped webpage
raw_first_df <- tibble(id_num = 1:409) %>% 
  mutate(
         # year contains the year of the "firsts"
         year = map(id_num, get_year),
         # data contains the raw lines / entries of the "firsts" 
         # we have one list of lines / entries per each year
         data = map(id_num, get_first)) %>% 
  
  # convert year to integer
  mutate(year = as.integer(year)) %>% 
  
  # fill empty year cells with the last existing value for year
  fill(year) %>% 
  
  # give each raw line / entry of a "first" its own row, i.e.
  # unnest the list of entries into separate rows
  unnest(data)


# function to extract the link to the wikipedia page of the person, that has achieved the "first"
extract_website <- function(x) {
  
  # parse raw line / entry of the "first" to extract the wikipedia link
  x %>% 
    str_replace(":.*?<a", ": <a") %>% 
    str_extract(": <a href=\\\".*?\\\"") %>% 
    str_extract("/wiki.*?\\\"") %>% 
    str_replace("\\\"", "")
  
}


# function to extract the concrete description of the "first"
extract_first <- function(x) {
  
  # parse raw line / entry of the "first" to extract the description of the "first"
  x %>% 
    str_extract("^First.*?:") %>% 
    str_replace(":", "")
  
}


# function to extract the name of the person, that has achieved the "first"
extract_name <- function(x) {
  
  # parse raw line / entry of the "first" to extract the name of the achiever
  x %>% 
    str_replace(":.*?<a", ": <a") %>% 
    str_extract(": <a href=\\\".*?\\\">") %>% 
    str_extract("title=.*\\\">") %>% 
    str_replace("title=\\\"", "") %>% 
    str_replace("\\\">", "")
  
}


# extract wikipedia links, names and description of "firsts"
clean_first <- raw_first_df %>%
  mutate(
         # get raw html string (with tags) for the complete lines of the "firsts" 
         data_string_raw = map_chr(data, toString),
         # get only html text (without tags) for the complete lines of the "firsts
         data_string_cle = map_chr(data, html_text)) %>% 
  
  mutate(
         # get the link to the wikipedia page of the person, that has achieved the "first"
         wiki = map_chr(data_string_raw, extract_website),
         # get the concrete description of the "first"
         first = map_chr(data_string_cle, extract_first),
         # get the name of the person, that has achieved the "first"
         name = map_chr(data_string_raw, extract_name)) %>% 
  # drop rows where some information is missing
  drop_na()


# clean up
rm(extract_first, extract_name, extract_website, 
   get_first, get_year, 
   raw_first, raw_first_df, first_url)

WOW! That was quite some parsing and wrangling with the HTML. However, this was only the beginning!
We do not yet have too much information.. We basically only know the year of the achievement, the name of the achiever, what the person did and a link to the wikipedia page of that person. What we would be really interested in is where the achiever was born (in order to perform geospatial analyses) and when the achiever was born (in order to analyse how old the person was when achieving the “first”).

Well, let’s move on and collect this information in the next section!

Scraping Individual Wikipedia Pages of the Achievers

In this section, we will scrape the individual Wikipedia webpages of each individual achiever. Thanks to our good work in the last section, we know the link to the Wikipedia page of each individual achiever!

We scrape the Wikipedia page of each individual achiever in order to find out their birth date and their location of birth.

# function to get the location and birthday of a person given a 
# link to the wikipedia page of that person
extract_bday_location <- function(wiki){
  
  # read complete wikipedia page of a person
  html <- read_html(paste0("https://en.wikipedia.org", wiki))
  
  
  # extract birth date by parsing the raw HTML
  
  # if we are lucky, wikipedia tells us exactly where
  # to find the birthday
  bd <- html %>% 
    html_node(glue::glue('span[class="bday"]')) %>% 
    html_text()
  
  # if not, we have to do some advanced HTML parsing to 
  # find the birthday
  if(is.na(bd)){
    
    bd <- html %>% 
      html_node("table.vcard") %>% 
      toString() %>% 
      str_replace_all("\\n", "") %>% 
      str_extract("Born.*?</td>") %>% 
      str_extract("<td>.*?<") %>% 
      str_replace("<td>", "") %>% 
      str_replace("<", "") %>% 
      str_replace("\\(.*", "") %>% 
      str_trim() %>% 
      str_replace_all("[^[:alnum:] ]", "") %>% 
      # convert parsed birthday string to a date
      parse_date(approx = TRUE)
    
  }
  
  # finally we convert the bday to a string
  if(!is.na(bd)){
    bd <- toString(bd)
  }
  
  
  # extract birth location by parsing the raw HTML
  
  # parse raw HTML to find birth location
  lo <- html %>% 
    html_node("table.vcard") %>% 
    toString() %>% 
    str_replace_all("\\n", "") %>% 
    str_extract("Born.*?</td>") %>% 
    str_replace("Born</th>", "")
  
  # handling of edge cases
  if(length(str_locate_all(lo, "<br>")[[1]]) == 4){
    lo <- str_replace(lo, "<br>", "")
  }
  
  lo <- lo %>% str_extract("<br><.*?</td>$")
  
  if(!is.na(lo)){
    lo <- lo %>% 
      read_html() %>% 
      html_text()
  }
  
  
  # return bday and location of birth as a list
  return(list(bd, lo))
  
}


# extract birthday and location and store them in new columns
clean_first_augmented <- suppressMessages(clean_first %>% 
                                         # extract birthday and location from wikipedia
                                         mutate(combi = map(wiki, extract_bday_location)) %>% 
                                         # unnest birthday and location into separate columns
                                         unnest_wider(combi) %>% 
                                         # rename new columns
                                         rename(bday = `...1`, location = `...2`) %>% 
                                         # convert bday to character (from list type)
                                         mutate(bday = map_chr(bday, function(x) x))) %>% 
  
  # it is possible that there was a mistake with the extraction of the birthday 
  # --> delete the wrong birthday in such cases
  mutate(bday = ifelse(year(bday) == 2020, NA_character_, bday))


# clean up
rm(clean_first, extract_bday_location)

Nice! We have successfully enriched our dataset with birthdays and locations. However, the locations alone are not really helpful. To do some geospatial analyses, we need the geocodes. Let’s move on!

Forward Geocoding of Locations

To be able to visualise our findings using maps, we need the geolocations. Hence, we use the opencage package to get lng/lat of the location of birth for each individual achiever:

# wrapper function of `opencage_forward` to handle missing locations
opencage_custom <- function(x) {
  
  # if there is no location, return NA
  if(is.na(x)){
    return(NA_character_)
  }
  
  # otherwise find geolocation with opencage
  else{
    return(opencage_forward(x, limit = 1))
  }
  
}


# find geolocation for each individual achiever
clean_first_augmented <- suppressMessages(clean_first_augmented %>% 
                                            
                                         # get information from opencage
                                         mutate(location_geo = map(location, opencage_custom)) %>% 
                                         
                                         # parse and clean the information from opencage
                                         unnest_wider(location_geo) %>% 
                                         unnest(results, keep_empty = TRUE) %>% 
                                         rename(lat = geometry.lat,
                                                lng = geometry.lng,
                                                country = components.country,
                                                state = components.state,
                                                county = components.county,
                                                city = components.city,
                                                FIPS_state = annotations.FIPS.state) %>% 
                                         select(id_num, year, data_string_raw, data_string_cle, 
                                                wiki, first, name, bday, 
                                                location, lat, lng, country, state, county, city, FIPS_state))


# clean up
rm(opencage_custom)

That went smoothly! We are now able to use lng/lat to produce beautiful maps! However, we are still not 100% happy with our data. It would be really great to also have information about the field of the achievement and the gender of the achiever. We will infer both variables in the next section.

Infer Gender and Category

For our analysis, we want to know which kind of “first” was achieved, i.e. we want to map each “first” into a category.
Additionally, we want to know which sex the achievers have.

We will start by getting a category variable. To map each “first” into a category, we define words that are indicators for specific categories. If we find such a word in the description of a “first”, we categorize this “first” accordingly:

# define indicator words for each category

edu <- c(
  "practice", "graduate", "learning", "college", "university", "medicine",
  "earn", "ph.d.", "professor", "teacher", "school", "nobel", "invent", "patent",
  "medicine", "degree", "doctor", "medical", "nurse", "physician", "m.d.", "b.a.", "b.s.", "m.b.a",
  "principal", "space", "astronaut", "scientific") %>% 
  paste0(collapse = "|")

religion <- c("bishop", "rabbi", "minister", "church", "priest", "pastor", "missionary",
              "denomination", "jesus", "jesuits", "diocese", "buddhis", "cardinal") %>%
  paste0(collapse = "|")

politics <- c(
  "diplomat", "elected", "nominee", "supreme court", "legislature", "mayor", "governor",
  "vice President", "president", "representatives", "political", "department", "peace prize",
  "ambassador", "government", "white house", "postal", "federal", "union", "trade",
  "delegate", "alder", "solicitor", "senator", "intelligience", "combat", "commissioner",
  "state", "first lady", "cabinet", "advisor", "guard", "coast", "secretary", "senate",
  "house", "agency", "staff", "national committee", "lie in honor") %>%
  paste0(collapse = "|")

sports <- c(
  "baseball", "football", "basketball", "hockey", "golf", "tennis",
  "championship", "boxing", "games", "medal", "game", "sport", "olympic", "nascar",
  "coach", "trophy", "nba", "nhl", "nfl", "mlb", "stanley cup", "jockey", "pga",
  "race", "driver", "ufc", "champion", "highest finishing position") %>%
  paste0(collapse = "|")

military <- c(
  "serve", "military", "enlist", "officer", "army", "marine", "naval",
  "officer", "captain", "command", "admiral", "prison", "navy", "general",
  "force") %>%
  paste0(collapse = "|")

law <- c("american bar", "lawyer", "police", "judge", "attorney", "law", 
         "agent", "fbi") %>%
  paste0(collapse = "|")

arts <- c(
  "opera", "sing", "perform", "music", "billboard", "oscar", "television",
  "movie", "network", "tony award", "paint", "author", "book", "academy award", "curator",
  "director", "publish", "novel", "grammy", "emmy", "smithsonian",
  "conduct", "picture", "pulitzer", "channel", "villain", "cartoon", "tv", "golden globe",
  "comic", "magazine", "superhero", "pulitzer", "dancer", "opry", "rock and roll", "radio",
  "record") %>%
  paste0(collapse = "|")

social <- c("community", "freemasons", "vote", "voting", "rights", "signature", 
            "royal", "ceo", "community", "movement", "invited", "greek", "million",
            "billion", "attendant", "chess", "pilot", "playboy", "own", "daughter",
            "coin", "dollar", "stamp", "niagara", "pharmacist",
            "stock", "north pole", "reporter", "sail around the world", "sail solo around the world", "press", "miss ",
            "everest")  %>%
  paste0(collapse = "|")


# categorize "firsts" by looking for indicator words in the description
first_df <- clean_first_augmented %>% 
  mutate(category = case_when(
    str_detect(tolower(first), military) ~ "Military",
    str_detect(tolower(first), law) ~ "Law",
    str_detect(tolower(first), arts) ~ "Arts & Entertainment",
    str_detect(tolower(first), social) ~ "Social & Jobs",
    str_detect(tolower(first), religion) ~ "Religion",
    str_detect(tolower(first), edu) ~ "Education & Science",
    str_detect(tolower(first), politics) ~ "Politics",
    str_detect(tolower(first), sports) ~ "Sports",
    TRUE ~ NA_character_
  )) %>% 
  rename(accomplishment = first)


# clean up
rm(arts, edu, law, first_url, military, politics, religion, social, sports,
   clean_first_augmented)

Next, we try to infer the sex of a person, that has achieved a “first”. To do this, we both look at the description of the first and look for indicators like “she”, “women” or “man” and use the gender package to infer sex:

# add gender

# see if we find words in the description of the "first", that identify gender
first_df <- first_df %>% 
  mutate(gender = if_else(str_detect(data_string_cle, 
                                     "\\swoman\\s|\\sWoman\\s|\\sher\\s|\\sshe\\s|\\sfemale\\s"), 
                          "female", 
                  if_else(str_detect(data_string_cle, 
                                     "\\sman\\s|\\sMan\\s|\\shim\\s|\\she\\s|\\smale\\s"), 
                          "male", 
                          "idk")))


# use gender package as second source of info (parse first name)
# input: full name and the year of the "first"
get_gender <- function(name, year){
  
  # get first name
  name <- strsplit(name, split = " ")[[1]][1]
  
  # define right method (see ?gender)
  method = ifelse(year < 1930, "ipums", "ssa")
  
  # get the gender
  ret <- gender(name, method = method, countries = "United States") %>% 
    select(gender) %>% 
    pull()
  
  if(typeof(ret) == "logical"){
    return(NA_character_)
  }
  else{
    return(ret)
  }
  
}


# build final "gender" column
first_df <- first_df %>% 
  # use first name and year to infer gender
  mutate(gender_2 = map2(name, year, get_gender)) %>% 
  # convert to character
  mutate(gender_2 = map_chr(gender_2, function(x) x)) %>%
  # combine both gender columns into one final column
  mutate(gender = if_else(gender != "idk", gender, gender_2)) %>% 
  select(-gender_2)


# clean up
rm(get_gender)

Great! We can finally say:


source

Let us save this part of our work as a csv:

write_csv(first_df, path = here("../data/firsts_augmented.csv"))

In the next section, we can now calculate variables like age and load shapefiles for the mapping.

Data Transformation and Shapefile Loading

Load and Transform “Firsts”

Let us load our scraped and curated data and have a look at it:

# load "firsts" data and clean column names
firsts <- read_csv(here("../data/firsts_augmented.csv"), 
                   col_types = cols(year = col_integer(), 
                                    id_num = col_integer())) %>% 
  clean_names()

# glimpse at data
glimpse(firsts)
## Rows: 524
## Columns: 18
## $ id_num          <int> 12, 14, 18, 20, 20, 22, 24, 29, 31, 33, 35, 37, 46,...
## $ year            <int> 1746, 1760, 1768, 1773, 1773, 1775, 1778, 1783, 178...
## $ data_string_raw <chr> "<li>First known African-American (and slave) to co...
## $ data_string_cle <chr> "First known African-American (and slave) to compos...
## $ wiki            <chr> "/wiki/Lucy_Terry", "/wiki/Jupiter_Hammon", "/wiki/...
## $ accomplishment  <chr> "First known African-American (and slave) to compos...
## $ name            <chr> "Lucy Terry", "Jupiter Hammon", "Wentworth Cheswell...
## $ category        <chr> "Social & Jobs", "Arts & Entertainment", "Social & ...
## $ bday            <date> NA, NA, 1746-04-11, NA, NA, NA, NA, NA, 1753-07-18...
## $ gender          <chr> "female", "male", "male", "female", "female", "male...
## $ location        <chr> "Africa", NA, "Newmarket, New Hampshire", "(likely ...
## $ lat             <dbl> 11.50243, NA, 43.08293, NA, NA, NA, NA, NA, 40.8725...
## $ lng             <dbl> 17.75781, NA, -70.93597, NA, NA, NA, NA, NA, -97.74...
## $ country         <chr> NA, NA, "United States of America", NA, NA, NA, NA,...
## $ state           <chr> NA, NA, "New Hampshire", NA, NA, NA, NA, NA, "Nebra...
## $ county          <chr> NA, NA, "Rockingham County", NA, NA, NA, NA, NA, NA...
## $ city            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ fips_state      <chr> NA, NA, "33", NA, NA, NA, NA, NA, "31", NA, NA, "10...

As we took care of cleaning the data while scraping, there is no much work left to do. We can see that we have quite some NAs. This is because we were not always able to scrape a bday or geocode a location. However, this shouldn’t be a big issue for our analysis and unfortunately there is nothing we can do about it.

We do not need the columns id_num, data_string_raw, data_string_cle, and wiki anymore. Hence, we can drop them:

firsts <- firsts %>% 
  select(-data_string_raw, 
         -data_string_cle,
         -wiki, 
         -id_num)

For our analysis, we need a binned year variable and a variable age, that measures how old a person was when achieving the “first”. We create them in the following:

# cut year into buckets 
# needed for gganimate and the timeline
firsts <- firsts %>% 
  mutate(year_bins = cut(year, 
                        breaks = c(min(year)-1, c(seq(1790, 2020, 10))), 
                        labels = c(seq(1790, 2020, 10)))) %>% 
  mutate(year_bins = as.integer(levels(year_bins))[year_bins]) %>% 
  mutate(year_bins_2 = cut(year, 
                        breaks = c(min(year)-1, c(seq(1760, 2020, 5))), 
                        labels = c(seq(1760, 2020, 5)))) %>% 
  mutate(year_bins_2 = as.integer(levels(year_bins_2))[year_bins_2])


# calculate age when person achieved the "first"
# also cut age into buckets
firsts <- firsts %>% 
  mutate(age = year - year(bday)) %>% 
  mutate(age_bins = cut(age, 
                        breaks = c(seq(0, 100, 10)), 
                        labels = c(seq(0, 90, 10)),
                        right  = FALSE)) %>% 
  mutate(age_bins = as.integer(levels(age_bins))[age_bins])

Great, that should be it! We conclude this section by describing the resulting tibble:

Data Dictionary

variable class description
year integer Year of the achievement
accomplishment character Description of the actual achievement or attainment
name character The person who accomplished the specific accomplishment
category character A few meta-categories of different accomplishments
bday date Birthday of the achiever
gender character Gender of the achiever
location character Location of birth of the achiever
lat double Location of birth of the achiever: Longitude
lng double Location of birth of the achiever: Latitude
country character Location of birth of the achiever: Country
state character Location of birth of the achiever: State
county character Location of birth of the achiever: County
city character Location of birth of the achiever: City
fips_state character Location of birth of the achiever: FIPS of State
year_bins integer Binned year of the achievement
age double Age of person when the “first” was achieved
age_bins integer Binned age

Load and Transform Shapefiles

To be able to plot our data on a map, we need the shapefiles of the US and we have to convert the firsts to a compatible format. For this, we leverage the sf package and the urbnmapr package:

# load states shapefile
states_sf <- get_urbn_map("states", sf = TRUE)

# transfrom geometry to 4326, or pairs of latitude/longitude numbers
states_sf <-  states_sf %>% 
  st_transform(4326) # transfrom to WGS84, latitude/longitude


# load counties shapefile
counties_sf <- get_urbn_map("counties", sf = TRUE)

# transfrom geometry to 4326, or pairs of latitude/longitude numbers
counties_sf <-  counties_sf %>% 
  st_transform(4326) # transfrom to WGS84, latitude/longitude


# convert firsts to a sf object with jitter on coordinates
set.seed(100)
firsts_jitter_sf <- firsts %>% 
  
  # drop rows with missing values
  drop_na(lng, lat, gender) %>%
  
  # jitter points such that they are better visible
  mutate(lng = jitter(lng, amount = 1),
         lat = jitter(lat, amount = 1)) %>% 
  
  # filter for valid locations in the US
  filter(country == "United States of America") %>% 
  filter(lng > -140) %>% 
  
  # convert to sf object
  st_as_sf(coords = c('lng', 'lat'), 
           crs = st_crs(states_sf))

Note that we have jittered the coordinates of the birth locations, as there are for example a lot of locations pointing to New York City. Using jitter makes the maps we will produce much more insightful, as points are not overlapping that much anymore.

What is also important for our analysis are actual population numbers. Because at first sight, the geospatial distribution of firsts might seem odd. However, taking actual population data into account might give a much clearer picture. Hence, we load population data from 1790-2010:

# load absolute population data from 1790-2010 and clean names
# source: https://conservancy.umn.edu/handle/11299/181605
population_abs <- read_xlsx(here("../data/county2010_hist_pops.xlsx"), sheet = "c2010_hist_pops") %>% 
  clean_names()

# bring data into long format and clean the year
population_abs <- population_abs %>% 
  
  # bring to long format
  pivot_longer(cols = epop1790:pop2010, names_to = "year", values_to = "pop") %>% 
  
  # extract / clean year
  mutate(year = str_sub(year, -4, -1)) %>% 
  
  # only take relevant columns
  select(geoid10, year, pop)


# load population density data from 1790-2010 and clean names
# source: https://conservancy.umn.edu/handle/11299/181605
population_dens <- read_xlsx(here("../data/county2010_hist_pops.xlsx"), sheet = "densities") %>% 
  clean_names()

# bring data into long format and clean the year
population_dens <- population_dens %>% 
  
  # bring to long format
  pivot_longer(cols = dens1790:dens2010, names_to = "year", values_to = "dens") %>% 
  
  # extract / clean year
  mutate(year = str_sub(year, -4, -1))


# join population densities and absolute numbers in one table
population <- population_dens %>% 
  left_join(population_abs) %>% 
  
  # cut density into buckets (otherwise we have some with values of over 2000, 
  # and others with values < 2 --> not good for visualisation / colouring)
  mutate(dens_2 = cut(dens, 
                      breaks = c(-0.1, 2, 6, 18, 45, 90, max(dens)),
                      labels = c("[0, 2]", 
                                 "(2, 6]", 
                                 "(6, 18]", 
                                 "(18, 45]", 
                                 "(45, 90]", 
                                 "90+"),
                      ordered_result = TRUE)) %>% 
  
  # convert year to integer
  mutate(year = as.integer(year))


# we only have population data until 2010. However, we also have "firsts" in the period of 2010-2020.
# to be able to animate this properly with gganimate, we will duplicate the 2010 values and set them
# as values for 2020. We will end up with population data for 1790 until 2020.
# To say it clear: we assume that there are no changes in population from 2010 to 2020.

pop_2020 <- population %>% 
  filter(year == 2010) %>% 
  mutate(year = 2020)

population <- bind_rows(population, pop_2020) %>% 
  arrange(geoid10, year)


# join counties_sf and population data
counties_pop_sf <- counties_sf %>% 
  left_join(population, by = c("county_fips" = "geoid10"))


# clean up
rm(population, population_abs, population_dens, pop_2020)

We end up having a ready-to-plot shapefile of “firsts”, a shapefile to plot US states and a shapefile to plot US counties. Additionally, we have a big shapefile counties_pop_sf, that holds information about the population in each US county since 1790!

Unbelievable, this was the last part of the data scraping and wrangling part! We are ready to go on now to visualise the data and tell our story!

Let’s say goodbye to this technically demanding part of our work:


source

The Story

“Firsts” over Time and a Stroll through American History

firsts %>% 
  # count number of "firsts" in each bin
  count(year_bins_2) %>% 
  
  # initiate ggplot
  ggplot(aes(x = year_bins_2, y = n)) +
  # plot "firsts" over time as blue line
  geom_line(size = 1, color = "#04314D") + 
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  
  # add vertical golden line for Civil War
  annotate("segment", 
           x = 1861 , 
           y = -Inf, 
           xend = 1861, 
           yend = 18, 
           color = "#D6AF00", 
           size = 1) +
  # add text for Civil War
  geom_text(data = data.frame(x = 1845, y = 16.5, label = "Civil War"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE,
  ) +
  
  # add labels
  labs(title = "A Stroll through American History",
       subtitle = "Number of \"Firsts\" over Time") +
  NULL

firsts %>% 
  # count number of "firsts" in each bin
  count(year_bins_2) %>%
  
  # initiate ggplot
  ggplot(aes(x = year_bins_2, y = n)) +
  # plot "firsts" over time as blue line
  geom_line(size = 1, color = "#04314D") + 
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  
  # add vertical golden line for Civil War
  annotate("segment", 
           x = 1861 , 
           y = -Inf, 
           xend = 1861, 
           yend = 18, 
           color = "#D6AF00", 
           size = 1, 
           alpha = 0.2) +
  # add text for Civil War
  geom_text(data = data.frame(x = 1845, y = 16.5, label = "Civil War"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE,
            alpha = 0.2
  ) +
  
  # add vertical golden line for WWII
  annotate("segment", 
           x = 1941 , 
           y = -Inf, 
           xend = 1941, 
           yend = 28, 
           color = "#D6AF00", 
           size = 1) +
  # add text for WWII
  geom_text(data = data.frame(x = 1920, y = 26.5, label = "World War II"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE
  ) +
  
  # add labels
  labs(title = "A Stroll through American History",
       subtitle = "Number of \"Firsts\" over Time") +
  NULL

firsts %>% 
  # count number of "firsts" in each bin
  count(year_bins_2) %>%
  
  # initiate ggplot
  ggplot(aes(x = year_bins_2, y = n)) +
  # plot "firsts" over time as blue line
  geom_line(size = 1, color = "#04314D") + 
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  
  # add vertical golden line for Civil War
  annotate("segment", 
           x = 1861 , 
           y = -Inf, 
           xend = 1861, 
           yend = 18, 
           color = "#D6AF00", 
           size = 1, 
           alpha = 0.2) +
  # add text for Civil War
  geom_text(data = data.frame(x = 1845, y = 16.5, label = "Civil War"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE,
            alpha = 0.2
  ) +
  
  # add vertical golden line for WWII
  annotate("segment", 
           x = 1941, 
           y = -Inf, 
           xend = 1941, 
           yend = 28, 
           color = "#D6AF00", 
           size = 1, 
           alpha = 0.2) +
  # add text for WWII
  geom_text(data = data.frame(x = 1920, y = 26.5, label = "World War II"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE,
            alpha = 0.2
  ) +
  
  # add line for I Have a Dream
  annotate("segment", 
           x = 1963 , 
           y = -Inf, 
           xend = 1963, 
           yend = 38, 
           color = "#D6AF00", 
           size = 1) +
  # add text for I Have a Dream
  geom_text(data = data.frame(x = 1905, y = 36.5, label = "March on Washington: I Have a Dream"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE
  ) +
  
  # add labels
  labs(title = "A Stroll through American History",
       subtitle = "Number of \"Firsts\" over Time") +
  NULL

firsts %>% 
  # count number of "firsts" in each bin
  count(year_bins_2) %>%
  
  # initiate ggplot
  ggplot(aes(x = year_bins_2, y = n)) +
  # plot "firsts" over time as blue line
  geom_line(size = 1, color = "#04314D") + 
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  
  # add vertical golden line for Civil War
  annotate("segment", 
           x = 1861 , 
           y = -Inf, 
           xend = 1861, 
           yend = 18, 
           color = "#D6AF00", 
           size = 1, 
           alpha = 0.2) +
  # add text for Civil War
  geom_text(data = data.frame(x = 1845, y = 16.5, label = "Civil War"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE,
            alpha = 0.2
  ) +
  
  # add vertical golden line for WWII
  annotate("segment", 
           x = 1941, 
           y = -Inf, 
           xend = 1941, 
           yend = 28, 
           color = "#D6AF00", 
           size = 1, 
           alpha = 0.2) +
  # add text for WWII
  geom_text(data = data.frame(x = 1920, y = 26.5, label = "World War II"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE,
            alpha = 0.2
  ) +
  
  # add line for I Have a Dream
  annotate("segment", 
           x = 1963 , 
           y = -Inf, 
           xend = 1963, 
           yend = 38, 
           color = "#D6AF00", 
           size = 1, 
           alpha = 0.2) +
  # add text for I Have a Dream
  geom_text(data = data.frame(x = 1905, y = 36.5, label = "March on Washington: I Have a Dream"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE,
            alpha = 0.2
  ) +
  
  # add line for Barack Obama
  annotate("segment", 
           x = 2008 , 
           y = -Inf, 
           xend = 2008, 
           yend = 48, 
           color = "#D6AF00", 
           size = 1) +
  # add text for Barack Obama
  geom_text(data = data.frame(x = 1952, y = 46.5, label = "First Black President: Barack Obama"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE
  ) +
  
  # add labels
  labs(title = "A Stroll through American History",
       subtitle = "Number of \"Firsts\" over Time") +
  NULL

firsts %>% 
  # count number of "firsts" in each bin
  count(year_bins_2) %>%
  
  # initiate ggplot
  ggplot(aes(x = year_bins_2, y = n)) +
  # plot "firsts" over time as blue line
  geom_line(size = 1, color = "#04314D") + 
  
  # add LOESS model line
  geom_smooth(method = "loess", 
              se = FALSE,
              color = "#D6AF00") + 
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  
  # add labels
  labs(title = "When will be the next Boost in \"Firsts\"?",
       subtitle = "Number of \"Firsts\" over Time") +
  
  # add arrow to left plateau
  geom_curve(data = data.frame(x = 1843, y = 33, xend = 1882, yend = 8.4),
             mapping = aes(x = x, y = y, xend = xend, yend = yend),
             colour = "black",
             curvature = 0,
             arrow = arrow(length = unit(2, "mm"), type = "closed"),
             inherit.aes = FALSE) +
  
  # add arrow to right plateau
  geom_curve(data = data.frame(x = 1845, y = 34, xend = 1995, yend = 26),
             mapping = aes(x = x, y = y, xend = xend, yend = yend),
             colour = "black",
             curvature = 0,
             arrow = arrow(length = unit(2, "mm"), type = "closed"),
             inherit.aes = FALSE) +
  
  # add text
  geom_text(data = data.frame(x = 1821.5, y = 34.7, label = "Two Plateaus"),
            aes(x = x, y = y, label = label),
            colour = "black",
            family = "Lato",
            inherit.aes = FALSE
  ) +
  NULL

We can see that there is basically a boost all 100 years!

Geographical Distribution of “Firsts”

# change names for gganimate to work
firsts_jitter_sf_animate <- firsts_jitter_sf %>% 
  rename(year = year_bins, year_full = year)

# visualise population density and locations of "firsts" over time
p <- ggplot() +
  
  # draw polygons from counties shapefile,
  # fill with population density
  geom_sf(data = counties_pop_sf, aes(fill = dens_2), 
          color = "#367696") +
  # define colours for population density fill
  scale_fill_manual(breaks = levels(counties_pop_sf$dens_2),
                    values = c("#e9eff2", 
                               "#93b4c4", 
                               "#57859c", 
                               "#2d627d", 
                               "#114966", 
                               "#04314D"),
                    name = "Pop. per sq. mile") +
  
  # enable a new "fill" dimension for gender
  new_scale_fill() + 
  
  # add locations from firsts shapefile
  geom_sf(
    data = firsts_jitter_sf_animate, 
    aes(fill = gender), 
    size = 9, 
    shape = 21,
    show.legend = TRUE,
    color = "white"
  ) + 
  # fill by gender
  scale_fill_manual(values = c("#D6AF00", "#4D0B0C")) +
  
  # add labels
  labs(subtitle = "Year: {previous_frame} - {current_frame}",
       title = "Location of \"Firsts\" and Population Density over Time",
       fill  = "Gender") + 
  
  # customize theme
  theme_minimal()+
  coord_sf(datum = NA) + # remove coordinates
  theme(axis.text = element_blank(),
        text = element_text(size = 28, 
                            family = "Lato", 
                            colour = "black"),
        plot.subtitle = element_text(size = 32, 
                            family = "Lato Black", 
                            colour = "black"),
        plot.title = element_text(size = 36, 
                            family = "Lato", 
                            colour = "black")) +
  
  # animate the "firsts" and pop density over the years
  transition_manual(year) + 
  NULL


# render gif and save it
# animate(p, fps = 2, height = 800, width = 1600)
# anim_save("firsts_over_time.gif", animation = last_animation())

## Filter for data before Civil War:
firsts_jitter_sf_filtered <- firsts_jitter_sf %>% 
  filter(year <= 1860)

# compute relevant average population density
counties_pop_sf_filtered <- counties_pop_sf %>% 
  filter(year <= 1860) %>% 
  
  # calculate average population density
  group_by(county_fips) %>% 
  summarise(dens_agg = mean(dens), geometry = first(geometry)) %>% 
  
  # cut into bins as in the data transformation part
  mutate(dens_2 = cut(dens_agg, 
                      breaks = c(-0.1, 2, 6, 18, 45, 90, max(dens_agg)),
                      labels = c("[0, 2]", 
                                 "(2, 6]", 
                                 "(6, 18]", 
                                 "(18, 45]", 
                                 "(45, 90]", 
                                 "90+"),
                      ordered_result = TRUE))


ggplot() +
  # draw polygons from counties shapefile
  # fill with population density
  geom_sf(data = counties_pop_sf_filtered, aes(fill = dens_2), 
          color = "#367696") +
  # define colours for population density fill
  scale_fill_manual(values = c("#e9eff2", "#93b4c4", "#57859c", "#2d627d", "#114966", "#04314D"),
                    name = "Pop. per sq. mile") +
  
  # enable a new "fill" dimension for gender
  new_scale_fill() + 
  
  # add locations from firsts shapefile
  geom_sf(
    data = firsts_jitter_sf_filtered, 
    aes(fill = gender), 
    size = 3, 
    shape = 21,
    show.legend = TRUE, 
    color = "white"
  ) + 
  # fill by gender
  scale_fill_manual(values = c("#D6AF00", "#4D0B0C")) +
  
  # add labels
  labs(subtitle = "before 1861",
       title = "Location of \"Firsts\" and Population Density",
       fill  = "Gender") +
  
  # customize theme
  theme_minimal()+
  coord_sf(datum = NA) + # remove coordinates
  theme(axis.text = element_blank(),
        text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  NULL

## Filter for data before WWII:
firsts_jitter_sf_filtered <- firsts_jitter_sf %>% 
  filter(year >= 1861, year < 1941)

# compute relevant average population density
counties_pop_sf_filtered <- counties_pop_sf %>% 
  filter(year <= 1940, year >= 1860) %>% 
  
  # calculate average population density
  group_by(county_fips) %>% 
  summarise(dens_agg = mean(dens), geometry = first(geometry)) %>% 
  
  # cut into bins as in the data transformation part
  mutate(dens_2 = cut(dens_agg, 
                      breaks = c(-0.1, 2, 6, 18, 45, 90, max(dens_agg)),
                      labels = c("[0, 2]", 
                                 "(2, 6]", 
                                 "(6, 18]", 
                                 "(18, 45]", 
                                 "(45, 90]", 
                                 "90+"),
                      ordered_result = TRUE))


ggplot() +
  # draw polygons from counties shapefile
  # fill with population density
  geom_sf(data = counties_pop_sf_filtered, aes(fill = dens_2), 
          color = "#367696") +
  # define colours for population density fill
  scale_fill_manual(values = c("#e9eff2", "#93b4c4", "#57859c", "#2d627d", "#114966", "#04314D"),
                    name = "Pop. per sq. mile") +
  
  # enable a new "fill" dimension for gender
  new_scale_fill() + 
  
  # add locations from firsts shapefile
  geom_sf(
    data = firsts_jitter_sf_filtered, 
    aes(fill = gender), 
    size = 3, 
    shape = 21,
    show.legend = TRUE, 
    color = "white"
  ) + 
  # fill by gender
  scale_fill_manual(values = c("#D6AF00", "#4D0B0C")) +
  
  # add labels
  labs(subtitle = "1861 - 1940",
       title = "Location of \"Firsts\" and Population Density",
       fill  = "Gender") +
  
  # customize theme
  theme_minimal()+
  coord_sf(datum = NA) + # remove coordinates
  theme(axis.text = element_blank(),
        text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  NULL

## Filter for data efore 1963:
firsts_jitter_sf_filtered <- firsts_jitter_sf %>% 
  filter(year >= 1941, year < 1963)

# compute relevant average population density
counties_pop_sf_filtered <- counties_pop_sf %>% 
  filter(year <= 1960, year >= 1940) %>% 
  
  # calculate average population density
  group_by(county_fips) %>% 
  summarise(dens_agg = mean(dens), geometry = first(geometry)) %>% 
  
  # cut into bins as in the data transformation part
  mutate(dens_2 = cut(dens_agg, 
                      breaks = c(-0.1, 2, 6, 18, 45, 90, max(dens_agg)),
                      labels = c("[0, 2]", 
                                 "(2, 6]", 
                                 "(6, 18]", 
                                 "(18, 45]", 
                                 "(45, 90]", 
                                 "90+"),
                      ordered_result = TRUE))


ggplot() +
  # draw polygons from counties shapefile
  # fill with population density
  geom_sf(data = counties_pop_sf_filtered, aes(fill = dens_2), 
          color = "#367696") +
  # define colours for population density fill
  scale_fill_manual(values = c("#e9eff2", "#93b4c4", "#57859c", "#2d627d", "#114966", "#04314D"),
                    name = "Pop. per sq. mile") +
  
  # enable a new "fill" dimension for gender
  new_scale_fill() + 
  
  # add locations from firsts shapefile
  geom_sf(
    data = firsts_jitter_sf_filtered, 
    aes(fill = gender), 
    size = 3, 
    shape = 21,
    show.legend = TRUE, 
    color = "white"
  ) + 
  # fill by gender
  scale_fill_manual(values = c("#D6AF00", "#4D0B0C")) +
  
  # add labels
  labs(subtitle = "1941 - 1962",
       title = "Location of \"Firsts\" and Population Density",
       fill  = "Gender") +
  
  # customize theme
  theme_minimal()+
  coord_sf(datum = NA) + # remove coordinates
  theme(axis.text = element_blank(),
        text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  NULL

## Filter for data before 2008:
firsts_jitter_sf_filtered <- firsts_jitter_sf %>% 
  filter(year >= 1963, year < 2008)

# compute relevant average population density
counties_pop_sf_filtered <- counties_pop_sf %>% 
  filter(year <= 2010, year >= 1960) %>% 
  
  # calculate average population density
  group_by(county_fips) %>% 
  summarise(dens_agg = mean(dens), geometry = first(geometry)) %>% 
  
  # cut into bins as in the data transformation part
  mutate(dens_2 = cut(dens_agg, 
                      breaks = c(-0.1, 2, 6, 18, 45, 90, max(dens_agg)),
                      labels = c("[0, 2]", 
                                 "(2, 6]", 
                                 "(6, 18]", 
                                 "(18, 45]", 
                                 "(45, 90]", 
                                 "90+"),
                      ordered_result = TRUE))


ggplot() +
  # draw polygons from counties shapefile
  # fill with population density
  geom_sf(data = counties_pop_sf_filtered, aes(fill = dens_2), 
          color = "#367696") +
  # define colours for population density fill
  scale_fill_manual(values = c("#e9eff2", "#93b4c4", "#57859c", "#2d627d", "#114966", "#04314D"),
                    name = "Pop. per sq. mile") +
  
  # enable a new "fill" dimension for gender
  new_scale_fill() + 
  
  # add locations from firsts shapefile
  geom_sf(
    data = firsts_jitter_sf_filtered, 
    aes(fill = gender), 
    size = 3, 
    shape = 21,
    show.legend = TRUE, 
    color = "white"
  ) + 
  # fill by gender
  scale_fill_manual(values = c("#D6AF00", "#4D0B0C")) +
  
  # add labels
  labs(subtitle = "1963 - 2007",
       title = "Location of \"Firsts\" and Population Density",
       fill  = "Gender") +
  
  # customize theme
  theme_minimal()+
  coord_sf(datum = NA) + # remove coordinates
  theme(axis.text = element_blank(),
        text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  NULL

## Filter for data after 2008:
firsts_jitter_sf_filtered <- firsts_jitter_sf %>% 
  filter(year >= 2008)

# compute relevant average population density
counties_pop_sf_filtered <- counties_pop_sf %>% 
  filter(year <= 2020, year >= 2000) %>% 
  
  # calculate average population density
  group_by(county_fips) %>% 
  summarise(dens_agg = mean(dens), geometry = first(geometry)) %>% 
  
  # cut into bins as in the data transformation part
  mutate(dens_2 = cut(dens_agg, 
                      breaks = c(-0.1, 2, 6, 18, 45, 90, max(dens_agg)),
                      labels = c("[0, 2]", 
                                 "(2, 6]", 
                                 "(6, 18]", 
                                 "(18, 45]", 
                                 "(45, 90]", 
                                 "90+"),
                      ordered_result = TRUE))


ggplot() +
  # draw polygons from counties shapefile
  # fill with population density
  geom_sf(data = counties_pop_sf_filtered, aes(fill = dens_2), 
          color = "#367696") +
  # define colours for population density fill
  scale_fill_manual(values = c("#e9eff2", "#93b4c4", "#57859c", "#2d627d", "#114966", "#04314D"),
                    name = "Pop. per sq. mile") +
  
  # enable a new "fill" dimension for gender
  new_scale_fill() + 
  
  # add locations from firsts shapefile
  geom_sf(
    data = firsts_jitter_sf_filtered, 
    aes(fill = gender), 
    size = 3, 
    shape = 21,
    show.legend = TRUE, 
    color = "white"
  ) + 
  # fill by gender
  scale_fill_manual(values = c("#D6AF00", "#4D0B0C")) +
  
  # add labels
  labs(subtitle = "from 2008",
       title = "Location of \"Firsts\" and Population Density",
       fill  = "Gender") +
  
  # customize theme
  theme_minimal()+
  coord_sf(datum = NA) + # remove coordinates
  theme(axis.text = element_blank(),
        text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  NULL

going west cultural center –> New York, Boston, LA, ..

Gender Gap over Time

sex_count <- firsts %>%
  filter(!is.na(gender),
         !is.na(year_bins)) %>% 
  group_by(year_bins, gender) %>% 
  count() %>% 
  arrange(year_bins)

# add missing year 1820
sex_count_1820 <- sex_count %>% 
  filter(year_bins == 1800) %>% 
  mutate(year_bins = as.integer(1820),
         n = as.integer(0))

sex_count <- bind_rows(sex_count, sex_count_1820)

rm(sex_count_1820)

sex_count_wider <- sex_count %>% 
  pivot_wider(names_from = gender, values_from = n) %>% 
  mutate(female = if_else(is.na(female), as.integer(0), female))

sex_count <- sex_count_wider %>% 
  pivot_longer(female:male, names_to = "gender", values_to = "n")


sex_count_wider <- sex_count_wider %>% 
  mutate(middle = ifelse(female > male,male,female))



sex_count_wider %>% 
  ggplot(aes(x = year_bins)) +
  geom_line(data = sex_count, 
            aes(x = year_bins, y = n, color = gender),
            size = 1) +
  scale_color_manual(values = c("#D6AF00", "#4D0B0C"), name = "Gender") +
  geom_ribbon(aes(ymin = middle, ymax = male, group = 1), fill = "grey50", alpha = 0.2) +
  geom_ribbon(aes(ymin = middle, ymax = female, group = 1), fill = "grey100", alpha = 0.2) +
  labs(title = "Gender Gap persists over Time",
       subtitle = "Number of \"Firsts\" by Gender",
       x = "Year",
       y = "Number of Firsts",
       caption = "") +
  # customize theme
  theme_minimal() + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black"))

  NULL
## NULL

Although African-Americans have growing achievements over the years, the gender gaps have also been broaden. In the two spikes of 1860s and 1940s, male number grew much more than female, and the difference between sex stays in following years, which can be clearly demonstrated from the shades in the graph. However, entering 21th century, the gender equality has improved a lot.

Analysing Gender and Category

gender_category_df<- firsts %>% 
  filter(!is.na(gender)) %>% 
  filter(!is.na(category)) %>% 
  mutate(category = if_else(category == "Education & Science", "Edu & Science", 
                    if_else(category == "Arts & Entertainment", "Arts & Entertainment", category))) %>% 
  group_by(gender) %>% 
  mutate(number_firsts= n()) %>% 
  ungroup() %>% 
  group_by(gender, category) %>% 
  mutate(number_category_firsts= n(),
         pct_category_firsts=number_category_firsts/ number_firsts) %>% 
  select(gender,category,pct_category_firsts) %>% 
  distinct() %>% 
  arrange(desc(pct_category_firsts)) %>% 
  mutate(category = fct_inorder(category))


gender_category_df %>% 
  ggplot(aes(y = pct_category_firsts, x = 
               reorder(category,desc(pct_category_firsts)),fill=gender)) +
  geom_col(width = 0.7, position='dodge')+
  scale_fill_manual(values = c("#D6AF00", "#4D0B0C"),
                    name = "Gender") +
  geom_text(
    aes(label = ifelse(gender=="male" & pct_category_firsts < 0.1, paste0(round(pct_category_firsts*100,0), "%"),"")), 
    color = "white", 
    size = 4.5,
    vjust = 1.5,
    hjust= -0.45
  ) +
  geom_text(
    aes(label = ifelse(gender=="male" & pct_category_firsts >= 0.1, paste0(round(pct_category_firsts*100,0), "%"),"")), 
    color = "white", 
    size = 4.5,
    vjust = 1.5,
    hjust= -0.15
  ) +
  geom_text(
    aes(label = ifelse(gender=="female" & pct_category_firsts < 0.1, paste0(round(pct_category_firsts*100,0), "%"),"")), 
    color = "white", 
    size = 4.5,
    vjust = 1.5,
    hjust=1.35
  ) +
  geom_text(
    aes(label = ifelse(gender=="female" & pct_category_firsts >= 0.1, paste0(round(pct_category_firsts*100,0), "%"),"")), 
    color = "white", 
    size = 4.5,
    vjust = 1.5,
    hjust=1.12
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent)+
 # scale_x_reordered()+
  labs(title = "African-American Achievements follow stereotypical Gender Roles!",
       subtitle  = "Firsts by Gender and Category")+
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        panel.grid = element_blank(),
        axis.text.y = element_blank(),
        text = element_text(family = "Lato", 
                            colour = "black",
                            size   = 13),
        plot.title = element_text(family = "Lato", 
                                  colour = "black",
                                  size = 18),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black",
                                     size = 15)) +
  NULL

Looking at the Firsts over Categories for African American Men and Women, we see that there are certain trends: Women’s firsts are most prevalent in the Arts and Entertainment sectors, however they are also present in the Education and Sciences. For Men, the Sciences and Military Sector are pretty strong, as well as Politics. This confirms the alleged gender bias and overrepresentation of males in technical and political spheres wheras women are traditionally pushed into the Arts and Social sectors.

Age Distribution by Category

# pipe data
firsts %>%
  
  # filter na values
  filter(!is.na(age_bins)) %>% 
  
  # group by age and category
  group_by(age_bins, category) %>% 
  
  # count number of individuals for each age interval and category
  count() %>% 
  
  # create bar chart showing age distribution for different categories
  ggplot(aes(x = age_bins, y = n)) +
  geom_col(fill = "#04314D") +
  
  # label graph
  labs(title = "Different \"Firsts\" are typically achieved with different ages!",
     #subtitle = "Categories like sports have relatively young firsts, whereas categories like politics have older firsts",
     subtitle = "Number of \"Firsts\" by Age and Category",
     x = "Age",
     y = "Number of \"Firsts\"") +
  
  # create multi-panel plots by category
  facet_wrap(~category) +
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  NULL

From this multi-planel bar graph, we can see that different categories have different age distributions. The firsts in categories like Art & Entertainment, Sports, Education & Science, and Social & Jobs are relatively young, whereas those firsts in categories like Politics and the Military are relatively older. We are unable to make conclusions on the age trends of categories in the religion and law as we do not have too many data points for these respective categories. We can see that the age trends between the Sports and Politics category are very different so let’s dive deeper into these 2 categories:

# pipe data
firsts %>%
  
  # filter na values
  filter(!is.na(age_bins)) %>% 
  
  # group by age and category
  group_by(age_bins, category) %>% 
  
  # filter for 2 categories that show contrasting findings in age intervals
  filter(category == "Sports" | category == "Politics")  %>%
  
  # order Sports before Politics
  mutate(category = factor(category, levels = c("Sports", "Politics"))) %>% 
  
  # count number of individuals for each age interval and category
  count() %>% 
  
  # create bar chart showing age distribution for two categories
  ggplot(aes(x = age_bins, y = n)) +
  geom_col(fill = "#04314D") +
  
  # label graph
  labs(title = "\"Firsts\" in Sports are generally achieved in a younger age than \"Firsts\" in Politics",
     # subtitle = "Firsts in sports are often times in their 25s whereas firsts in politics are often times in their 50s",
     subtitle = "Number of \"Firsts\" by Age and Category",
     x = "Age",
     y = "Number of \"Firsts\"") +
  
  # create multi-panel plots by category
  facet_wrap(~category) +
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black")) +
  NULL

From the zoomed-in comparison of the two categories, we can see that the age distribution between the sports and politics category is very different. This may be because individuals who work in the political sector are generally older than those who work in sports. In addition, the politics sector has also been slightly more hierarchical compared to the sports sector. These two explanations ensure that the findings from the plot is reasonable.

Firsts per State

states <- firsts %>% 
  filter(!is.na(state)) %>% #removing NAs from the data
  count(state) %>% 
  mutate(tot  = sum(n), 
         pct = round(n/sum(n)*100, 1),
         NY = ifelse(`state` == "New York", TRUE, FALSE)) %>%
  filter(pct > 3) %>% 
  arrange(desc(n))

ggplot(states, aes(x =  pct, 
                   # order by counts and fill New York
                   y = reorder(state, pct), fill = NY)) +  
  
  geom_col() + 
  
  scale_fill_manual(values = c("grey80", "#04314D")) + 
  
  scale_x_continuous(expand = expansion(mult = c(0, 0))) +
  
  # adding the percentages on the columns
  geom_text(
    aes(label = paste0(pct, "%")),
    colour = "white",
    size = 4,
    hjust = 1.2,
    family="Lato")+
  
  # adding the comment on new york 
  geom_text(
    data = data.frame(x = 7, y = 3.5, label = "New York outnumbers"),
    aes(x = x, y = y, label = label),
    colour="black",
    family="Lato",
    size = 4,
    hjust = 1,
    lineheight = .6,
    inherit.aes = FALSE)+
  geom_text(
    data = data.frame(x = 6.65, y = 2.7, label = "all other states"),
    aes(x = x, y = y, label = label),
    colour="black",
    family="Lato",
    size = 4,
    hjust = 1,
    lineheight = .6,
    inherit.aes = FALSE)+
  
  # adding the arrow pointing on the New York Column
  geom_curve(
    data = data.frame(x = 6, y = 4, xend = 7.5, yend = 13),
    mapping = aes(x = x, y = y, xend = xend, yend = yend),
    colour = "black",
    size = 0.5,
    curvature = 0.18,
    arrow = arrow(length = unit(2, "mm"), type = "closed"),
    inherit.aes = FALSE) +
  
  # adding titles and subtitles
  labs(title = "The most liberal States take the highest number of \"Firsts\"",
       subtitle = "Percentage of \"Firsts\" per State",
       caption = "") +
  
  # customize theme
  theme_minimal() + 
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title.position = "plot",
        legend.position = "none") +
  theme(text = element_text(family = "Lato", 
                            colour = "black"),
        plot.title = element_text(family = "Lato", 
                                  colour = "black"),
        plot.subtitle = element_text(family = "Lato", 
                                     colour = "black"),
        ) +
  NULL

Due to the growth in population in America being focused on the East Coast until around the 1920s, much earlier than parts for central and western America, it can be seen from the bar chart above that the number of firsts have been significantly higher. This could be due to the migration of communities over time.

The data collected starts from 1800, during which the percentage of “free” blacks was not uniform, the northern eastern region had a much higher percentage of free blacks throughout the 1800s which may have also contributed to the stark difference between the west and east during that time. The ablution of slavery in the 1860s additionally saw a huge impact in the growth of the firsts in African American communities.

New York had the highest count in firsts. New York State was one of the pioneer states to abolish slavery in 1827( It was officially abolished in 1863 in the US), and New York City had one of the largest concentration of free African Americans, with many institutions being established to support the community.